Skip to content

feat!: allow non-terminal + segment in read structures#1157

Merged
tfenne merged 3 commits into
mainfrom
tf_non_terminal_plus_in_rs
May 19, 2026
Merged

feat!: allow non-terminal + segment in read structures#1157
tfenne merged 3 commits into
mainfrom
tf_non_terminal_plus_in_rs

Conversation

@tfenne

@tfenne tfenne commented Apr 23, 2026

Copy link
Copy Markdown
Member

The + (indefinite-length) segment can now appear at any position within a read structure, not only the last, mirroring the Rust read-structure crate PR #14 on the Scala side. Extract resolves middle- and leading-+ structures by walking backward from the end of the read for post-+ segments.

Breaking API changes:

  • ReadSegment.offset removed; ReadStructure tracks offsets internally via a private signed array. Constructors now take (length, kind) instead of (offset, length, kind).
  • ReadStructure.apply(segments, resetOffsets) simplified to apply(segments); offsets are always computed.
  • Fixed-length structures reject over-long reads instead of silently truncating. Callers that may see variable-length input should route it through withVariableLastSegment.
  • extract gained an includeSkips: Boolean = false parameter; Skip segments are now omitted from extract results by default. All in-tree callers were already filtering by kind, so no behavior change for them.

ExtractUmisFromBam.updateClippingInformation now walks segments accumulating a running non-template-length sum before the clip position. It requires all but the last segment to be fixed-length, since translating a clip past a non-terminal + would need the read length.

The + segment still means zero-or-more bases here, deliberately differing from the Rust crate's one-or-more; DemuxFastqs relies on zero-or-more to handle short/trimmed reads via
withVariableLastSegment.

The `+` (indefinite-length) segment can now appear at any position
within a read structure, not only the last, mirroring the Rust
`read-structure` crate PR #14 on the Scala side. Extract resolves
middle- and leading-+ structures by walking backward from the end
of the read for post-+ segments.

Breaking API changes:
 - ReadSegment.offset removed; ReadStructure tracks offsets
   internally via a private signed array. Constructors now take
   (length, kind) instead of (offset, length, kind).
 - ReadStructure.apply(segments, resetOffsets) simplified to
   apply(segments); offsets are always computed.
 - Fixed-length structures reject over-long reads instead of
   silently truncating. Callers that may see variable-length
   input should route it through withVariableLastSegment.
 - extract gained an includeSkips: Boolean = false parameter;
   Skip segments are now omitted from extract results by default.
   All in-tree callers were already filtering by kind, so no
   behavior change for them.

ExtractUmisFromBam.updateClippingInformation now walks segments
accumulating a running non-template-length sum before the clip
position. It requires all but the last segment to be fixed-length,
since translating a clip past a non-terminal + would need the
read length.

The + segment still means zero-or-more bases here, deliberately
differing from the Rust crate's one-or-more; DemuxFastqs relies on
zero-or-more to handle short/trimmed reads via
withVariableLastSegment.
@tfenne tfenne requested review from clintval and nh13 as code owners April 23, 2026 22:34
@codecov

codecov Bot commented Apr 23, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 98.91304% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 95.97%. Comparing base (768d38c) to head (f0fed8e).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
...scala/com/fulcrumgenomics/util/ReadStructure.scala 98.68% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1157      +/-   ##
==========================================
+ Coverage   95.90%   95.97%   +0.06%     
==========================================
  Files         133      133              
  Lines        8327     8390      +63     
  Branches      942     1015      +73     
==========================================
+ Hits         7986     8052      +66     
+ Misses        341      338       -3     
Flag Coverage Δ
unittests 95.97% <98.91%> (+0.06%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@github-actions

github-actions Bot commented Apr 23, 2026

Copy link
Copy Markdown
PR Preview Action v1.6.1

🚀 View preview at
https://fulcrumgenomics.github.io/fgbio/pr-preview/pr-1157/

Built to branch gh-pages at 2026-05-19 21:17 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

@coderabbitai

coderabbitai Bot commented Apr 23, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: d12a2a7b-82c3-4027-a009-0327deda8c2a

📥 Commits

Reviewing files that changed from the base of the PR and between a3c7fbe and f0fed8e.

📒 Files selected for processing (3)
  • src/main/scala/com/fulcrumgenomics/fastq/DemuxFastqs.scala
  • src/main/scala/com/fulcrumgenomics/fastq/FastqToBam.scala
  • src/test/scala/com/fulcrumgenomics/umi/ExtractUmisFromBamTest.scala
✅ Files skipped from review due to trivial changes (2)
  • src/main/scala/com/fulcrumgenomics/fastq/DemuxFastqs.scala
  • src/main/scala/com/fulcrumgenomics/fastq/FastqToBam.scala

📝 Walkthrough

Walkthrough

This pull request revises read-structure handling and docs: CLI help now allows at most one '+' segment anywhere; ReadSegment no longer stores offsets and ReadStructure computes per-segment offsets and supports non-terminal '+'; extract(...) gains includeSkips and omits Skip by default; ExtractUmisFromBam recalculates clipping positions against the new segment model; tests expanded to cover boundary and '+' cases.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: allowing non-terminal + segments in read structures, matching the primary objective of the PR.
Description check ✅ Passed The description comprehensively covers the changeset, explaining the feature, breaking API changes, and implementation details related to the modified files.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch tf_non_terminal_plus_in_rs

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
src/test/scala/com/fulcrumgenomics/util/ReadStructureTest.scala (1)

63-106: Good coverage of the new +-at-any-position rule: multi-+ rejection via both strings and sequences, round-trip through toString, and non-positive-length rejection.

Might also be worth asserting that a non-adjacent multi-+ (e.g. ReadStructure("8B+M10T+S")) rejects, to guard the "at most one" invariant across the whole structure rather than just adjacent cases. Optional.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/test/scala/com/fulcrumgenomics/util/ReadStructureTest.scala` around lines
63 - 106, Add a test asserting that non-adjacent multiple '+' segments are
rejected to ensure the "at most one + anywhere" invariant; specifically, update
ReadStructureTest (tests around the "accept the + segment at any position, at
most once" block) to include an assertion that ReadStructure("8B+M10T+S") (and
similar non-adjacent multi-+ strings) throws an Exception using the same
an[Exception] shouldBe thrownBy pattern so the parser rejects multiple '+' even
when they are not adjacent.
src/test/scala/com/fulcrumgenomics/umi/ExtractUmisFromBamTest.scala (1)

430-449: Good boundary coverage — especially the "clip exactly on a segment boundary" cases, which are where the offset < clippingPosition loop guard matters most.

All four expected values check out against the new implementation. Optional: consider also asserting that a non-terminal + with clippingAttribute throws the new require (e.g. ReadStructure("10M+M5T") with XT set), to lock in the precondition.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/test/scala/com/fulcrumgenomics/umi/ExtractUmisFromBamTest.scala` around
lines 430 - 449, Add a negative test asserting that the new precondition (the
require) rejects a non-terminal '+' clipping operator when a clippingAttribute
is present: in ExtractUmisFromBamTest.scala add a case that sets record("XT") to
some value and calls ExtractUmisFromBam.updateClippingInformation(record,
Some("XT"), ReadStructure("10M+M5T")) (or similar with a non-terminal '+') and
assert that the call throws the expected exception; reference
ExtractUmisFromBam.updateClippingInformation, ReadStructure and the clipping
attribute name "XT" when locating where to add the test.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/main/scala/com/fulcrumgenomics/fastq/FastqToBam.scala`:
- Around line 58-62: The wording uses the jargon "monotemplate" in the paragraph
describing the read-structure example (which shows `5M5S+T`), but that term is
confusing because the example's second five bases are a skip (`5S`); update the
sentence in FastqToBam.scala (the paragraph around the `5M5S+T` example) to say
the second five bases "should be skipped" or "are non-template/skipped" instead
of "monotemplate" so the description matches the `5S` operation.

In `@src/main/scala/com/fulcrumgenomics/util/ReadStructure.scala`:
- Around line 253-301: Defaulting includeSkips to false in the extract overloads
(see extract(bases: String): Seq[SubReadWithoutQuals] and extract(bases: String,
quals: String, includeSkips: Boolean = false): Seq[SubReadWithQuals]) is a
silent breaking change for out-of-tree callers who relied on Skip segments;
update the release notes/changelog to call this out as a breaking change and
also update the Scaladoc for both extract methods to explicitly document the new
default behavior and recommend callers pass includeSkips = true if they relied
on Skip entries.

---

Nitpick comments:
In `@src/test/scala/com/fulcrumgenomics/umi/ExtractUmisFromBamTest.scala`:
- Around line 430-449: Add a negative test asserting that the new precondition
(the require) rejects a non-terminal '+' clipping operator when a
clippingAttribute is present: in ExtractUmisFromBamTest.scala add a case that
sets record("XT") to some value and calls
ExtractUmisFromBam.updateClippingInformation(record, Some("XT"),
ReadStructure("10M+M5T")) (or similar with a non-terminal '+') and assert that
the call throws the expected exception; reference
ExtractUmisFromBam.updateClippingInformation, ReadStructure and the clipping
attribute name "XT" when locating where to add the test.

In `@src/test/scala/com/fulcrumgenomics/util/ReadStructureTest.scala`:
- Around line 63-106: Add a test asserting that non-adjacent multiple '+'
segments are rejected to ensure the "at most one + anywhere" invariant;
specifically, update ReadStructureTest (tests around the "accept the + segment
at any position, at most once" block) to include an assertion that
ReadStructure("8B+M10T+S") (and similar non-adjacent multi-+ strings) throws an
Exception using the same an[Exception] shouldBe thrownBy pattern so the parser
rejects multiple '+' even when they are not adjacent.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 9a3c5f08-44b3-4474-9c6f-85f3f3bfff9b

📥 Commits

Reviewing files that changed from the base of the PR and between 768d38c and a3c7fbe.

📒 Files selected for processing (7)
  • src/main/scala/com/fulcrumgenomics/fastq/DemuxFastqs.scala
  • src/main/scala/com/fulcrumgenomics/fastq/FastqToBam.scala
  • src/main/scala/com/fulcrumgenomics/illumina/RunInfo.scala
  • src/main/scala/com/fulcrumgenomics/umi/ExtractUmisFromBam.scala
  • src/main/scala/com/fulcrumgenomics/util/ReadStructure.scala
  • src/test/scala/com/fulcrumgenomics/umi/ExtractUmisFromBamTest.scala
  • src/test/scala/com/fulcrumgenomics/util/ReadStructureTest.scala

Comment thread src/main/scala/com/fulcrumgenomics/fastq/FastqToBam.scala
Comment on lines +253 to 301
/** Extracts the bases corresponding to each read segment. [[SegmentType.Skip]] segments are omitted
* from the returned sequence because callers almost always throw those bases away; use the two-argument
* overload with `includeSkips = true` to keep them. */
def extract(bases: String): Seq[SubReadWithoutQuals] = extract(bases, includeSkips = false)

/** Extracts the bases corresponding to each read segment.
*
* @param bases the raw read bases
* @param includeSkips whether to include [[SegmentType.Skip]] segments in the returned sequence
*/
def extract(bases: String, includeSkips: Boolean): Seq[SubReadWithoutQuals] = {
validateReadLength(bases.length)
val readLen = bases.length
val builder = IndexedSeq.newBuilder[SubReadWithoutQuals]
var i = 0
while (i < segments.length) {
val seg = segments(i)
if (includeSkips || seg.kind != SegmentType.Skip) {
val (start, end) = spanOf(i, readLen)
builder += SubReadWithoutQuals(bases.substring(start, end), seg)
}
i += 1
}
builder.result()
}

/** Extracts the bases and qualities corresponding to each read segment.
*
* @param bases the raw read bases
* @param quals the raw read qualities; must be the same length as `bases`
* @param includeSkips whether to include [[SegmentType.Skip]] segments in the returned sequence;
* defaults to `false` because callers almost always throw those bases away.
*/
def extract(bases: String, quals: String, includeSkips: Boolean = false): Seq[SubReadWithQuals] = {
require(bases.length == quals.length, s"Bases and quals differ in length: ${bases.length} vs ${quals.length}")
validateReadLength(bases.length)
val readLen = bases.length
val builder = IndexedSeq.newBuilder[SubReadWithQuals]
var i = 0
while (i < segments.length) {
val seg = segments(i)
if (includeSkips || seg.kind != SegmentType.Skip) {
val (start, end) = spanOf(i, readLen)
builder += SubReadWithQuals(bases.substring(start, end), quals.substring(start, end), seg)
}
i += 1
}
builder.result()
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Skip-exclusion default is a silent semantic change — worth a release-notes mention.

In-tree callers (DemuxFastqs, FastqToBam, ExtractUmisFromBam) all post-filter by kind, so this is safe here. Out-of-tree callers that previously iterated extract(bases) without filtering will silently lose Skip entries. Given feat!, that's acceptable — just call it out in the changelog alongside the other breaking items.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/main/scala/com/fulcrumgenomics/util/ReadStructure.scala` around lines 253
- 301, Defaulting includeSkips to false in the extract overloads (see
extract(bases: String): Seq[SubReadWithoutQuals] and extract(bases: String,
quals: String, includeSkips: Boolean = false): Seq[SubReadWithQuals]) is a
silent breaking change for out-of-tree callers who relied on Skip segments;
update the release notes/changelog to call this out as a breaking change and
also update the Scaladoc for both extract methods to explicitly document the new
default behavior and recommend callers pass includeSkips = true if they relied
on Skip entries.

@clintval clintval left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does the Scala implementation allow a zero-length + operator but the Rust implementation does not? Would you be willing to update our wiki docs on read structures here too?

https://github.com/fulcrumgenomics/fgbio/wiki/Read-Structures

Comment on lines 250 to 251
|molecular identifiers will be concatenated using
|the `-` delimiter and placed in the given SAM record tag (`RX` by default). Similarly, the sample barcode bases

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line lengths are a bit broken here.

Should render fine, but looks odd in the source code.

Comment on lines +431 to +449
// molecular-barcode prefix that gets stripped: a clip past the 10M should shift down by 10
record("XT") = 50
ExtractUmisFromBam.updateClippingInformation(record, Some("XT"), ReadStructure("10M90T"))
record[Int]("XT") shouldBe 40

// clip position exactly on a segment boundary (end of the skip)
record("XT") = 30
ExtractUmisFromBam.updateClippingInformation(record, Some("XT"), ReadStructure("20T10S70T"))
record[Int]("XT") shouldBe 20

// clip position exactly on the next segment's start when that segment is a template
record("XT") = 20
ExtractUmisFromBam.updateClippingInformation(record, Some("XT"), ReadStructure("10M10T80T"))
record[Int]("XT") shouldBe 10

// multiple template segments with a mol barcode between them; clip lands in the second template
record("XT") = 40
ExtractUmisFromBam.updateClippingInformation(record, Some("XT"), ReadStructure("20T10M70T"))
record[Int]("XT") shouldBe 30

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Decent tests, but did you mean to add any read segments with the + operator before/after the UMI segment to flex the new functionality?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a positive test for a trailing +T, but the tool actually explicitly checks for and forbids a + segment in the non terminal position (separate and apart from the read-structure parsing), so I've added a test that that check still fires.

Throughout this PR I've avoided changing behavior other than the RS parsing. I think we could entertain separate PRs to loosen restrictions in tools where it makes sense.

@tfenne

tfenne commented May 19, 2026

Copy link
Copy Markdown
Member Author

Why does the Scala implementation allow a zero-length + operator but the Rust implementation does not? Would you be willing to update our wiki docs on read structures here too?

https://github.com/fulcrumgenomics/fgbio/wiki/Read-Structures

Honestly, I think it should be that + is one or more, but there is behavior in DemuxFastqs that relies on it being 0-or-more. I didn't want to break that, so left it be for now. Individual tools can always assert that each segment has a length post-extraction.

I will update the wiki docs once this PR is merged.

Add a positive regression case for a terminal + segment with a clipping
attribute, and a negative case asserting the non-terminal + precondition
fires. Also reflow the read-structure paragraph in DemuxFastqs so the
source no longer wraps mid-clause. Both per PR #1157 review feedback.
@tfenne tfenne merged commit 21470e2 into main May 19, 2026
13 checks passed
@tfenne tfenne deleted the tf_non_terminal_plus_in_rs branch May 19, 2026 21:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants